In this project, the goal is to use a sample of publicly available customer data from telecommunications provider Telco to: 1) give an overview of their customer base; and 2) apply customer segmentation using the k-prototype clustering algorithm - an extension of k-means clustering that can deal with non-numeric data.
The analysis generated four clusters that have been labled as follows after exploration:
library(dplyr)
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(clustMixType)
The data for this project is a sample of Telco’s - a telecommunications operator - customer database made publicly available on Kaggle.
Each row represents a customer, and each column contains customer’s attributes.The data set includes information about:
#creating folder for data
if(!file.exists('project_data')) {
dir.create('project_data')
}
#importing data
data = read.csv('/Users/raqdom/DS_Projects/Clustering_Customers/WA_Fn-UseC_-Telco-Customer-Churn.csv', na.strings=c("", "NA"))
The data has 21 columns/customer variables and 7,043 rows/customers.
dim(data)
## [1] 7043 21
head(data)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
The first 5 rows look as follows:
head(data)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
The bulk of variables are categorical, except for tenure, monthly charges, total charges, and SerniorCitizen (which looks like a factor).
str(data)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
In total, only the column ‘TotalCharges’ has missing information on 11 customers.
NA count per column
missing = sort(sapply(data, function(x) sum(is.na(x))))
missing[missing>0]
## TotalCharges
## 11
A quick investigation of the data suggests that ‘TotalCharges’ is a function of ‘MonthlyCharges’ multiplied by ‘tenure’ (in months). For those missing values, tenure = 0 as presumably these are new customers that have not been charged yet. We can use this to replace NAs with zero when we clean the data.
Total Charges
qplot(data$MonthlyCharges*data$tenure, data$TotalCharges, xlab='Monthly Charges x Tenure', ylab='Total Charges') #total charges = monthly charges * tenure but NA if tenure = 0
We will replace any NAs in ‘TotalCharges’ with zeros as described above.
data2 = data %>% mutate(TotalCharges = ifelse(is.na(TotalCharges), 0, TotalCharges))
Where more than two categories can be grouped for one variable, we will group them (i.e. for variable ‘MultipleLines’, we will group ‘No’ and ‘No phone service’) as follows:
data2 = data2 %>%
mutate(SeniorCitizen = as.factor(ifelse(SeniorCitizen == 0, 'No', 'Yes')),
MultipleLines = as.factor(ifelse(MultipleLines %in% c('No phone service', 'No'), 'No', 'Yes')),
OnlineSecurity = as.factor(ifelse(OnlineSecurity %in% c('No internet service', 'No'), 'No', 'Yes')),
OnlineBackup = as.factor(ifelse(OnlineBackup %in% c('No internet service', 'No'), 'No', 'Yes')),
DeviceProtection = as.factor(ifelse(DeviceProtection %in% c('No internet service', 'No'), 'No', 'Yes')),
TechSupport = as.factor(ifelse(TechSupport %in% c('No internet service', 'No'), 'No', 'Yes')),
StreamingTV = as.factor(ifelse(StreamingTV %in% c('No internet service', 'No'), 'No', 'Yes')),
StreamingMovies = as.factor(ifelse(StreamingMovies %in% c('No internet service', 'No'), 'No', 'Yes')))
The following charts summarise the customer dataset. A quick overview suggests:
An ideal customer here will be defined in lifetime value terms: someone whose average spend is high (measured by monthly charges); and whose lifetime with the company is long (tenure).
The assumption is that Telco would want to maximise customer lifetime value, and wants to understand its drivers so it can improve it (i.e.: through marketing/service offers).
To visualise this, we will plot each customer (represented by the dots) alongside monthly charges and tenure, and overlay each variable to bring out any pattern. For clearer visualisation, the median tenure/monthly charges are added.
In theory, Telco would want every customer to be in the top right quadrant (high charges/high loyalty), and might deploy campaigns designed to move customers from one quadrant to the other.
Clustering with customer data - or customer segmentation - is used to group individuals so that customers in the same group or cluster are more similar to each other than those in other groups/clusters.
K-means is a simple unsupervised machine learning algorithm that groups data into a specified number (k) of clusters. It searches for a a partition of the data into k clusters that minimises the within groups sum of squared errors - this is the sum of the squared differences between each observation and its group’s mean, and can be used as a measure of variation within a cluster. If all cases within a cluster are identical the sum of squared errors would then be equal to 0.
However, k-means only works on numeric data. Since here we have factor variables, we will use the k-prototypes algorithm, which is an extension of k-means to deal with categorical data. More info available in Zhexue Huang’s paper.
Before applying clustering to the data, let’s standardise numeric variables so they’re on the same scale and get rid of the customerID column, which isn’t useful in clustering.
data_std = data2 %>%
mutate_if(is.numeric, scale) %>% #scaling numeric variables
select(-customerID) #taking out customerID so it's not used in clustering
This algorithm requires us to input the number of clusters we want to group our customers into. To guide this decision, we will save down the values of the within groups sum of squared errors (SSE) against the number of clusters (1 to 10).
Known as the ‘elbow’ method, the idea is that the chart (tends to look like an arm) will show us where we would start to have diminishing returns. We want to choose a small value of k that still has low SSE.
From looking at this chart, we will stick to grouping customers into 4 clusters.
set.seed(123)
wss = sapply(1:10, function(k){kproto(data_std, k, verbose = FALSE)$tot.withinss})
# Total within-cluster sum of squares vs Choice of k
qplot(1:10, wss, geom=c('point', 'line'),xlab='Clusters', ylab='Within-cluster sum of squares')
Now that we know we want four clusters, we can run the k-prototypes algorithm on the customer data.
set.seed(123)
data_k4 <- kproto(data_std, 4)
This has grouped customers into four clusters, with the following number of customers in each:
data_k4$size
## clusters
## 1 2 3 4
## 1462 2810 1907 864
We can now explore the properties of each cluster in relation to our variables, and start painting a picture of who these groups could describe so they could be used strategically by the business.
Having explore the profiles (plots below), they could be summarised as follows:
The following show the % of customers by cluster across our variables.